# Large Language Model Integration

Videorefer 7B
Apache-2.0
VideoRefer-7B is a multimodal large language model focused on video question answering tasks, capable of understanding and analyzing spatiotemporal object relationships in videos.
Text-to-Video Transformers English
V
DAMO-NLP-SG
87
4
Videollama2 8x7B Base
Apache-2.0
VideoLLaMA 2 is a next-generation video large language model, focusing on enhancing spatiotemporal modeling and audio understanding capabilities, supporting multimodal video question answering and description tasks.
Text-to-Video Transformers English
V
DAMO-NLP-SG
20
2
Blip2 Opt 6.7b
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation and visual question answering tasks.
Image-to-Text Transformers English
B
merve
26
2
Heron Preliminary Git Llama 2 70b V0
A vision-language model pre-trained on image-text pairs, based on Llama-2 70B architecture, suitable for image caption generation tasks.
Image-to-Text Transformers Japanese
H
turing-motors
14
1
Blip2 Opt 6.7b 8bit
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model (OPT-6.7b) for image-to-text generation tasks.
Image-to-Text Transformers English
B
Mediocreatmybest
16
1
Idefics 80b
Other
IDEFICS-9B is a 9-billion-parameter multimodal model capable of processing both image and text inputs to generate text outputs. It is an open-source replication of Deepmind's Flamingo model.
Image-to-Text Transformers English
I
HuggingFaceM4
70
70
Blip2 Opt 6.7b
MIT
BLIP-2 is a vision-language model based on OPT-6.7b, pretrained by freezing the image encoder and large language model, supporting tasks such as image-to-text generation and visual question answering.
Image-to-Text Transformers English
B
Salesforce
5,871
76
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase